The goal of this analysis is to understand the factors influencing customer churn in a telecom dataset. We aim to explore and visualize various features, inspect patterns associated with churn, and assess the relationships between churn and various demographic, service-related, and financial characteristics. By doing so, we aim to identify key factors contributing to churn and leverage these insights in building predictive models that can help in customer retention efforts.
The analysis starts by examining the dataset structure and handling missing values, followed by visualizing data distributions across several categorical and numerical variables. Subsequently, machine learning models are employed to predict churn, with performance comparisons among various classifiers, including K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), Random Forest, Logistic Regression, and Decision Tree models.
library(tidyverse) # for data manipulation and visualization
library(visdat) # for visualizing missing data
library(caret) # for machine learning workflows
library(rpart) # for decision trees
library(randomForest)
library(e1071) # for naive bayes and svm
library(class) # for knn
library(kernlab) # for svm
library(nnet) # for neural networks
library(ada) # for adaboost
library(gbm) # for gradient boosting
library(xgboost)
library(pROC) # for ROC curves
library(plotly)
library(gridExtra)
library(ggplot2)
library(scales)
library(corrplot)
library(ggridges)
library(tidyr)
library(reshape2)
# Load and examine data
df <- read.csv('Telecom_Customer_Churn.csv')
head(df)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
dim(df)
## [1] 7043 21
str(df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
colnames(df)
## [1] "customerID" "gender" "SeniorCitizen" "Partner"
## [5] "Dependents" "tenure" "PhoneService" "MultipleLines"
## [9] "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection"
## [13] "TechSupport" "StreamingTV" "StreamingMovies" "Contract"
## [17] "PaperlessBilling" "PaymentMethod" "MonthlyCharges" "TotalCharges"
## [21] "Churn"
sapply(df, class)
## customerID gender SeniorCitizen Partner
## "character" "character" "integer" "character"
## Dependents tenure PhoneService MultipleLines
## "character" "integer" "character" "character"
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## "character" "character" "character" "character"
## TechSupport StreamingTV StreamingMovies Contract
## "character" "character" "character" "character"
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## "character" "character" "numeric" "numeric"
## Churn
## "character"
# Drop customerID column
df <- df %>% select(-customerID)
head(df)
## gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
## 1 Female 0 Yes No 1 No No phone service
## 2 Male 0 No No 34 Yes No
## 3 Male 0 No No 2 Yes No
## 4 Male 0 No No 45 No No phone service
## 5 Female 0 No No 2 Yes No
## 6 Female 0 No No 8 Yes Yes
## InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1 DSL No Yes No No
## 2 DSL Yes No Yes No
## 3 DSL Yes Yes No No
## 4 DSL Yes No Yes Yes
## 5 Fiber optic No No No No
## 6 Fiber optic No No Yes No
## StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No Month-to-month Yes
## 2 No No One year No
## 3 No No Month-to-month Yes
## 4 No No One year No
## 5 No No Month-to-month Yes
## 6 Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
# Convert TotalCharges to numeric
df$TotalCharges <- as.numeric(as.character(df$TotalCharges))
# Check missing values
colSums(is.na(df))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 11 0
# View rows where TotalCharges is NA
df[is.na(df$TotalCharges), ]
## gender SeniorCitizen Partner Dependents tenure PhoneService
## 489 Female 0 Yes Yes 0 No
## 754 Male 0 No Yes 0 Yes
## 937 Female 0 Yes Yes 0 Yes
## 1083 Male 0 Yes Yes 0 Yes
## 1341 Female 0 Yes Yes 0 No
## 3332 Male 0 Yes Yes 0 Yes
## 3827 Male 0 Yes Yes 0 Yes
## 4381 Female 0 Yes Yes 0 Yes
## 5219 Male 0 Yes Yes 0 Yes
## 6671 Female 0 Yes Yes 0 Yes
## 6755 Male 0 No Yes 0 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 489 No phone service DSL Yes No
## 754 No No No internet service No internet service
## 937 No DSL Yes Yes
## 1083 Yes No No internet service No internet service
## 1341 No phone service DSL Yes Yes
## 3332 No No No internet service No internet service
## 3827 Yes No No internet service No internet service
## 4381 No No No internet service No internet service
## 5219 No No No internet service No internet service
## 6671 Yes DSL No Yes
## 6755 Yes DSL Yes Yes
## DeviceProtection TechSupport StreamingTV
## 489 Yes Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes No Yes
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 No Yes No
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 489 No Two year Yes Bank transfer (automatic)
## 754 No internet service Two year No Mailed check
## 937 Yes Two year No Mailed check
## 1083 No internet service Two year No Mailed check
## 1341 No Two year No Credit card (automatic)
## 3332 No internet service Two year No Mailed check
## 3827 No internet service Two year No Mailed check
## 4381 No internet service Two year No Mailed check
## 5219 No internet service One year Yes Mailed check
## 6671 No Two year No Mailed check
## 6755 No Two year Yes Bank transfer (automatic)
## MonthlyCharges TotalCharges Churn
## 489 52.55 NA No
## 754 20.25 NA No
## 937 80.85 NA No
## 1083 25.75 NA No
## 1341 56.05 NA No
## 3332 19.85 NA No
## 3827 25.35 NA No
## 4381 20.00 NA No
## 5219 19.70 NA No
## 6671 73.35 NA No
## 6755 61.90 NA No
# Find and remove rows where tenure is 0
zero_tenure_indices <- which(df$tenure == 0)
df <- df[-zero_tenure_indices, ]
# Verify removal
sum(df$tenure == 0)
## [1] 0
# Fill NA values with mean
df$TotalCharges[is.na(df$TotalCharges)] <- mean(df$TotalCharges, na.rm = TRUE)
# Check missing values again
colSums(is.na(df))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Churn
## 0 0 0 0
# Map SeniorCitizen values
df$SeniorCitizen <- ifelse(df$SeniorCitizen == 0, "No", "Yes")
head(df)
## gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
## 1 Female No Yes No 1 No No phone service
## 2 Male No No No 34 Yes No
## 3 Male No No No 2 Yes No
## 4 Male No No No 45 No No phone service
## 5 Female No No No 2 Yes No
## 6 Female No No No 8 Yes Yes
## InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1 DSL No Yes No No
## 2 DSL Yes No Yes No
## 3 DSL Yes Yes No No
## 4 DSL Yes No Yes Yes
## 5 Fiber optic No No No No
## 6 Fiber optic No No Yes No
## StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No Month-to-month Yes
## 2 No No One year No
## 3 No No Month-to-month Yes
## 4 No No One year No
## 5 No No Month-to-month Yes
## 6 Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 1 Electronic check 29.85 29.85 No
## 2 Mailed check 56.95 1889.50 No
## 3 Mailed check 53.85 108.15 Yes
## 4 Bank transfer (automatic) 42.30 1840.75 No
## 5 Electronic check 70.70 151.65 Yes
## 6 Electronic check 99.65 820.50 Yes
# Describe InternetService column
summary(df$InternetService)
## Length Class Mode
## 7032 character character
table(df$InternetService)
##
## DSL Fiber optic No
## 2416 3096 1520
# Define numerical columns and get summary statistics
numerical_cols <- c('tenure', 'MonthlyCharges', 'TotalCharges')
summary(df[numerical_cols]) # For missing value handling alternatives
## tenure MonthlyCharges TotalCharges
## Min. : 1.00 Min. : 18.25 Min. : 18.8
## 1st Qu.: 9.00 1st Qu.: 35.59 1st Qu.: 401.4
## Median :29.00 Median : 70.35 Median :1397.5
## Mean :32.42 Mean : 64.80 Mean :2283.3
## 3rd Qu.:55.00 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :72.00 Max. :118.75 Max. :8684.8
Gender Distribution: Displays the proportions of male and female customers, providing a demographic breakdown. Churn Distribution: Shows the proportion of customers who churned versus those who stayed. Visualizing these together gives an understanding of the churn rate within each gender group.
# Gender and Churn Distribution (Donut charts)
# Plot 1: Gender Distribution
gender_plot <- plot_ly() %>%
add_pie(data = as.data.frame(table(df$gender)),
labels = ~Var1,
values = ~Freq,
hole = 0.4,
name = "Gender") %>%
layout(title = "Gender Distribution",
annotations = list(text = "Gender",
x = 0.5,
y = 0.5,
showarrow = FALSE))
# Plot 2: Churn Distribution
churn_plot <- plot_ly() %>%
add_pie(data = as.data.frame(table(df$Churn)),
labels = ~Var1,
values = ~Freq,
hole = 0.4,
name = "Churn") %>%
layout(title = "Churn Distribution",
annotations = list(text = "Churn",
x = 0.5,
y = 0.5,
showarrow = FALSE))
# Arrange plots side by side
subplot(gender_plot, churn_plot)
# Churn counts by gender
churn_no <- table(df$gender[df$Churn == "No"])
churn_yes <- table(df$gender[df$Churn == "Yes"])
This bar plot categorizes customers by their contract type and shows the churn rate within each category. It illustrates how contract length might impact customer retention.
# Method 1: Using basic ggplot2
ggplot(df, aes(x = Churn, fill = Contract)) +
geom_bar(position = "dodge") +
labs(title = "Customer Contract Distribution") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
Visualized as a pie chart, this plot shows the distribution of different payment methods and their association with churn. Understanding payment preferences can help in customizing retention strategies.
# Payment Method Distribution
plot_ly(data = as.data.frame(table(df$PaymentMethod)),
labels = ~Var1,
values = ~Freq,
type = 'pie',
hole = 0.3) %>%
layout(title = "Payment Method Distribution")
# Payment Method vs Churn
ggplot(df, aes(x = Churn, fill = PaymentMethod)) +
geom_bar(position = "dodge") +
ggtitle("Customer Payment Method Distribution w.r.t. Churn") +
theme_minimal()+
scale_fill_brewer(palette = "Set1")
A grouped bar plot displays churn distribution based on internet service type across genders. This plot helps assess whether certain internet services are associated with higher churn rates.
# Internet Service Analysis
# Unique values
unique(df$InternetService)
## [1] "DSL" "Fiber optic" "No"
# Count by gender and churn
male_counts <- table(df$InternetService[df$gender == "Male"],
df$Churn[df$gender == "Male"])
female_counts <- table(df$InternetService[df$gender == "Female"],
df$Churn[df$gender == "Female"])
# Internet Service by Gender and Churn
internet_data <- data.frame(
Churn = rep(c("No", "No", "Yes", "Yes"), 3),
Gender = rep(c("Female", "Male", "Female", "Male"), 3),
Service = rep(c("DSL", "Fiber optic", "No Internet"), each = 4),
Count = c(965, 992, 219, 240, # DSL
889, 910, 664, 633, # Fiber optic
690, 717, 56, 57) # No Internet
)
ggplot(internet_data, aes(x = interaction(Churn, Gender), y = Count, fill = Service)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Churn Distribution w.r.t. Internet Service and Gender") +
theme_minimal()+
scale_fill_brewer(palette = "Set4")
# Dependents Distribution
ggplot(df, aes(x = Churn, fill = Dependents)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("#AB63FA", "#FF97FF")) +
ggtitle("Dependents Distribution") +
theme_minimal()
# Partner Distribution
ggplot(df, aes(x = Churn, fill = Partner)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("#00CC96", "#FFA15A")) +
ggtitle("Churn Distribution w.r.t. Partners") +
theme_minimal()
# Senior Citizen Distribution
ggplot(df, aes(x = Churn, fill = SeniorCitizen)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("#B6E880", "#00CC96")) +
labs(title = "Churn Distribution w.r.t. Senior Citizen") +
theme_minimal()
Illustrates the role of online security in customer retention. Churn rates for customers with online security services are compared to those without.
# Alternative with more customization
ggplot(df, aes(x = Churn, fill = OnlineSecurity)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("No" = "#AB63FA",
"Yes" = "#FF97FF",
"No internet service" = "#BFBFBB")) +
labs(title = "Churn Distribution w.r.t. Online Security",
x = "Churn",
y = "Count",
fill = "Online Security") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 14),
legend.position = "right"
)
Each plot visualizes churn rates for customers who opted for these services, showing if certain service features correlate with churn tendencies.
# Paperless Billing Distribution
ggplot(df, aes(x = Churn, fill = PaperlessBilling)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("#00CC96", "#FFA15A")) +
labs(title = "Churn Distribution w.r.t. Paperless Billing") +
theme_minimal()
# Tech Support Distribution
ggplot(df, aes(x = Churn, fill = TechSupport)) +
geom_bar(position = "dodge") +
labs(title = "Churn Distribution w.r.t. TechSupport") +
theme_minimal()
# Phone Service Distribution
ggplot(df, aes(x = Churn, fill = PhoneService)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("#B6E880", "#00CC96")) +
labs(title = "Churn Distribution w.r.t. Phone Service") +
theme_minimal()
This density plot shows the distribution of monthly charges among churned and retained customers, highlighting any differences in spending patterns that correlate with churn.
# Monthly Charges Density Plot
ggplot(df, aes(x = MonthlyCharges, fill = Churn)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("Red", "Blue")) +
labs(title = "Distribution of Monthly Charges by Churn",
x = "Monthly Charges",
y = "Density") +
theme_minimal()
Similar to monthly charges, this plot shows the distribution of total charges among churned and non-churned customers, allowing for insights into long-term customer spending.
# Total Charges Density Plot
ggplot(df, aes(x = TotalCharges, fill = Churn)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("Gold", "Green")) +
labs(title = "Distribution of Total Charges by Churn",
x = "Total Charges",
y = "Density") +
theme_minimal()
A box plot compares tenure between churned and retained customers, providing a visual summary of how customer longevity might influence churn.
# Tenure Box Plot
plot_ly(df, x = ~Churn, y = ~tenure, type = "box") %>%
layout(title = "Tenure vs Churn",
xaxis = list(title = "Churn"),
yaxis = list(title = "Tenure (Months)"),
width = 750, height = 600)
A heatmap displays the correlations between numerical variables, including the encoded categorical ones. This plot provides an overview of how different features correlate with each other and with churn.
# Correlation Matrix
# First convert categorical variables to numeric
df_numeric <- df %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.numeric)
# Calculate correlation matrix
corr_matrix <- cor(df_numeric, use = "complete.obs")
# Correlation heatmap using ggplot2
ggplot(data = reshape2::melt(corr_matrix), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Matrix",
x = "",
y = "") +
coord_fixed()
# Function to convert categorical variables to numeric (Label Encoding)
object_to_int <- function(x) {
if(is.character(x) || is.factor(x)) {
return(as.numeric(factor(x)) - 1) # Subtract 1 to match Python's 0-based encoding
}
return(x)
}
# Apply label encoding to all columns
df_encoded <- df %>%
mutate_all(object_to_int)
# Show first few rows
head(df_encoded)
## gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
## 1 0 0 1 0 1 0 1
## 2 1 0 0 0 34 1 0
## 3 1 0 0 0 2 1 0
## 4 1 0 0 0 45 0 1
## 5 0 0 0 0 2 1 0
## 6 0 0 0 0 8 1 2
## InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1 0 0 2 0 0
## 2 0 2 0 2 0
## 3 0 2 2 0 0
## 4 0 2 0 2 2
## 5 1 0 0 0 0
## 6 1 0 0 2 0
## StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod
## 1 0 0 0 1 2
## 2 0 0 1 0 3
## 3 0 0 0 1 3
## 4 0 0 1 0 0
## 5 0 0 0 1 2
## 6 2 2 0 1 2
## MonthlyCharges TotalCharges Churn
## 1 29.85 29.85 0
## 2 56.95 1889.50 0
## 3 53.85 108.15 1
## 4 42.30 1840.75 0
## 5 70.70 151.65 1
## 6 99.65 820.50 1
A bar plot highlights features with the highest correlations to churn, aiding in identifying the most significant factors for predictive modeling.
# Correlation with Churn (sorted)
correlations <- cor(df_encoded)[,'Churn']
correlations_sorted <- sort(correlations, decreasing = TRUE)
print(correlations_sorted)
## Churn MonthlyCharges PaperlessBilling SeniorCitizen
## 1.000000000 0.192858218 0.191454321 0.150541053
## PaymentMethod MultipleLines PhoneService gender
## 0.107852015 0.038043274 0.011691399 -0.008544643
## StreamingTV StreamingMovies InternetService Partner
## -0.036302722 -0.038801748 -0.047097165 -0.149981926
## Dependents DeviceProtection OnlineBackup TotalCharges
## -0.163128439 -0.177883195 -0.195290209 -0.199484084
## TechSupport OnlineSecurity tenure Contract
## -0.282232487 -0.289050176 -0.354049359 -0.396149533
# Visualize correlations
ggplot(data = data.frame(
variable = names(correlations_sorted),
correlation = correlations_sorted
), aes(x = reorder(variable, correlation), y = correlation)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Correlations with Churn",
x = "Variables",
y = "Correlation")
# Split features and target
X <- df_encoded %>% select(-Churn)
y <- df_encoded$Churn
# Split data into training and testing sets
set.seed(40) # for reproducibility
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Function to create distribution plots
plot_distribution <- function(feature, data, color = "red") {
ggplot(data, aes_string(x = feature)) +
geom_density(fill = color, alpha = 0.5) +
theme_minimal() +
ggtitle(paste("Distribution for", feature))
}
# Plot distributions for numeric columns
num_cols <- c("tenure", "MonthlyCharges", "TotalCharges")
plots <- lapply(num_cols, function(col) {
plot_distribution(col, df_encoded)
})
# Display plots in a grid
gridExtra::grid.arrange(grobs = plots, ncol = 2)
# Standardize numeric columns
# Create preprocessing object
preproc <- preProcess(df_encoded[num_cols], method = c("center", "scale"))
# Apply standardization
df_std <- predict(preproc, df_encoded[num_cols])
# Plot standardized distributions
std_plots <- lapply(num_cols, function(col) {
plot_distribution(col, df_std, color = "cyan")
})
# Display standardized plots in a grid
gridExtra::grid.arrange(grobs = std_plots, ncol = 2)
# Define columns for different encoding methods
cat_cols_ohe <- c('PaymentMethod', 'Contract', 'InternetService')
cat_cols_le <- setdiff(
setdiff(names(X_train), num_cols),
cat_cols_ohe
)
# Standardize numeric columns in training and test sets
X_train[num_cols] <- predict(preproc, X_train[num_cols])
X_test[num_cols] <- predict(preproc, X_test[num_cols])
# Function to perform one-hot encoding
perform_ohe <- function(data, columns) {
# Create dummy variables
dummies <- dummyVars(~ ., data = data[columns])
encoded <- predict(dummies, newdata = data[columns])
# Convert to data frame
encoded_df <- as.data.frame(encoded)
# Remove original columns and add encoded ones
data <- data %>%
select(-all_of(columns)) %>%
bind_cols(encoded_df)
return(data)
}
# Apply one-hot encoding
X_train <- perform_ohe(X_train, cat_cols_ohe)
X_test <- perform_ohe(X_test, cat_cols_ohe)
# Create final preprocessed datasets
train_data <- data.frame(X_train, Churn = y_train)
test_data <- data.frame(X_test, Churn = y_test)
# KNN
knn_pred <- knn(train_data[, -ncol(train_data)], test_data[, -ncol(test_data)], train_data$Churn, k = 5)
knn_conf_matrix <- confusionMatrix(as.factor(knn_pred), as.factor(test_data$Churn))
print(knn_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1309 281
## 1 227 292
##
## Accuracy : 0.7591
## 95% CI : (0.7403, 0.7772)
## No Information Rate : 0.7283
## P-Value [Acc > NIR] : 0.0007038
##
## Kappa : 0.3728
##
## Mcnemar's Test P-Value : 0.0186982
##
## Sensitivity : 0.8522
## Specificity : 0.5096
## Pos Pred Value : 0.8233
## Neg Pred Value : 0.5626
## Prevalence : 0.7283
## Detection Rate : 0.6207
## Detection Prevalence : 0.7539
## Balanced Accuracy : 0.6809
##
## 'Positive' Class : 0
##
# SVC
svc_model <- svm(Churn ~ ., data = train_data, kernel = "radial")
svc_pred <- predict(svc_model, test_data)
svc_conf_matrix <- confusionMatrix(factor(svc_pred, levels = unique(train_data$Churn)), factor(test_data$Churn, levels = unique(train_data$Churn)))
print(svc_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
# Random Forest
rf_model <- randomForest(Churn ~ ., data = train_data)
rf_pred <- predict(rf_model, test_data)
rf_conf_matrix <- confusionMatrix(factor(rf_pred, levels = unique(train_data$Churn)), factor(test_data$Churn, levels = unique(train_data$Churn)))
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
#Plot ROC Curve
train_data$Churn <- as.factor(train_data$Churn)
test_data$Churn <- as.factor(test_data$Churn)
# Retrain Random Forest model with Churn as a factor
rf_model <- randomForest(Churn ~ ., data = train_data)
# Get predicted probabilities for the positive class
rf_prob <- predict(rf_model, test_data, type = "prob")[, 2]
# Plot ROC curve
roc_curve <- roc(test_data$Churn, rf_prob)
plot(roc_curve, col = "blue", lwd = 2, main = "ROC Curve for RandomForest Model")
abline(a = 0, b = 1, col = "gray", lty = 2) # Add diagonal line for reference
# Add AUC to the plot
auc <- auc(roc_curve)
text(0.6, 0.4, paste("AUC =", round(auc, 2)), col = "blue", cex = 1.2)
# Fit a Decision Tree model
dt_model <- rpart(Churn ~ ., data = train_data, method = "class")
# Predict on the test set
dt_pred <- predict(dt_model, test_data, type = "class")
# Print the confusion matrix
dt_conf_matrix <- confusionMatrix(dt_pred, test_data$Churn)
print(dt_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1422 357
## 1 114 216
##
## Accuracy : 0.7767
## 95% CI : (0.7583, 0.7943)
## No Information Rate : 0.7283
## P-Value [Acc > NIR] : 2.025e-07
##
## Kappa : 0.3492
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9258
## Specificity : 0.3770
## Pos Pred Value : 0.7993
## Neg Pred Value : 0.6545
## Prevalence : 0.7283
## Detection Rate : 0.6743
## Detection Prevalence : 0.8435
## Balanced Accuracy : 0.6514
##
## 'Positive' Class : 0
##
Through this analysis, we identified several key drivers of churn, including monthly charges, tenure, contract type, and additional services like online security. Customers with shorter tenures, higher monthly charges, and flexible contracts were found to have higher churn rates. Moreover, the absence of online security and tech support services was also associated with increased churn.
In predictive modeling, the Random Forest classifier outperformed other models, providing a strong balance between accuracy and interpretability. This suggests that a Random Forest model could be effectively used in customer retention strategies to proactively identify at-risk customers based on these key factors.
Moving forward, the findings from this analysis could be integrated into strategic initiatives, such as offering targeted discounts or personalized service upgrades to at-risk customers. By leveraging these insights, telecom companies can reduce churn rates and enhance customer satisfaction.